{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# NLP Information Extraction\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## SparkContext and SparkSession"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from pyspark import SparkContext\n",
    "sc = SparkContext(master = 'local')\n",
    "\n",
    "from pyspark.sql import SparkSession\n",
    "spark = SparkSession.builder \\\n",
    "          .appName(\"Python Spark SQL basic example\") \\\n",
    "          .config(\"spark.some.config.option\", \"some-value\") \\\n",
    "          .getOrCreate()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Simple NLP pipeline architecture\n",
    "\n",
    "![](images/simple-nlp-pipeline.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Reference:** Bird, Steven, Ewan Klein, and Edward Loper. Natural language processing with Python: analyzing text with the natural language toolkit. \" O'Reilly Media, Inc.\", 2009.\n",
    "\n",
    "## Example data\n",
    "\n",
    "The raw text is from the gutenberg corpus from the nltk package. The fileid is *milton-paradise.txt*.\n",
    "\n",
    "### Get the data\n",
    "\n",
    "#### Raw text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import nltk\n",
    "from nltk.corpus import gutenberg\n",
    "\n",
    "milton_paradise = gutenberg.raw('milton-paradise.txt')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create a spark data frame to store raw text\n",
    "\n",
    "* Use the `nltk.sent_tokenize()` function to split text into sentences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------------------+\n",
      "|           sentences|\n",
      "+--------------------+\n",
      "|[Paradise Lost by...|\n",
      "|And chiefly thou,...|\n",
      "|Say first--for He...|\n",
      "|Who first seduced...|\n",
      "|Th' infernal Serp...|\n",
      "+--------------------+\n",
      "only showing top 5 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "pdf = pd.DataFrame({\n",
    "        'sentences': nltk.sent_tokenize(milton_paradise)\n",
    "    })\n",
    "df = spark.createDataFrame(pdf)\n",
    "df.show(n=5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tokenization and POS tagging"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.sql.functions import udf\n",
    "from pyspark.sql.types import *\n",
    "\n",
    "## define udf function\n",
    "def sent_to_tag_words(sent):\n",
    "    wordlist = nltk.word_tokenize(sent)\n",
    "    tagged_words = nltk.pos_tag(wordlist)\n",
    "    return(tagged_words)\n",
    "## define schema for returned result from the udf function\n",
    "## the returned result is a list of tuples.\n",
    "schema = ArrayType(StructType([\n",
    "            StructField('f1', StringType()),\n",
    "            StructField('f2', StringType())\n",
    "        ]))\n",
    "        \n",
    "## the udf function\n",
    "sent_to_tag_words_udf = udf(sent_to_tag_words, schema)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Transform data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------------------+\n",
      "|        tagged_words|\n",
      "+--------------------+\n",
      "|[[[,JJ], [Paradis...|\n",
      "|[[And,CC], [chief...|\n",
      "|[[Say,NNP], [firs...|\n",
      "|[[Who,WP], [first...|\n",
      "|[[Th,NNP], [',POS...|\n",
      "+--------------------+\n",
      "only showing top 5 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df_tagged_words = df.select(sent_to_tag_words_udf(df.sentences).alias('tagged_words'))\n",
    "df_tagged_words.show(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Chunking\n",
    "\n",
    "Chunking is the process of segmenting and labeling multitokens. The following example shows how to do a noun phrase chunking on the tagged words data frame from the previous step.\n",
    "\n",
    "First we define a *udf* function which chunks noun phrases from a list of pos-tagged words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import nltk\n",
    "from pyspark.sql.functions import udf\n",
    "from pyspark.sql.types import *\n",
    "\n",
    "# define a udf function to chunk noun phrases from pos-tagged words\n",
    "grammar = \"NP: {<DT>?<JJ>*<NN>}\"\n",
    "chunk_parser = nltk.RegexpParser(grammar)\n",
    "chunk_parser_udf = udf(lambda x: str(chunk_parser.parse(x)), StringType())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Transform data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df_NP_chunks = df_tagged_words.select(chunk_parser_udf(df_tagged_words.tagged_words).alias('NP_chunk'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
      "|NP_chunk                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |\n",
      "+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
      "|(S\n",
      "  [/JJ\n",
      "  Paradise/NNP\n",
      "  Lost/VBN\n",
      "  by/IN\n",
      "  John/NNP\n",
      "  Milton/NNP\n",
      "  1667/CD\n",
      "  ]/NNP\n",
      "  Book/NNP\n",
      "  I/PRP\n",
      "  Of/IN\n",
      "  Man/NNP\n",
      "  's/POS\n",
      "  (NP first/JJ disobedience/NN)\n",
      "  ,/,\n",
      "  and/CC\n",
      "  (NP the/DT fruit/NN)\n",
      "  Of/IN\n",
      "  (NP that/DT forbidden/JJ tree/NN)\n",
      "  whose/WP$\n",
      "  (NP mortal/JJ taste/NN)\n",
      "  Brought/NNP\n",
      "  (NP death/NN)\n",
      "  into/IN\n",
      "  the/DT\n",
      "  World/NNP\n",
      "  ,/,\n",
      "  and/CC\n",
      "  all/DT\n",
      "  our/PRP$\n",
      "  (NP woe/NN)\n",
      "  ,/,\n",
      "  With/IN\n",
      "  (NP loss/NN)\n",
      "  of/IN\n",
      "  Eden/NNP\n",
      "  ,/,\n",
      "  till/VB\n",
      "  one/CD\n",
      "  greater/JJR\n",
      "  (NP Man/NN)\n",
      "  Restore/NNP\n",
      "  us/PRP\n",
      "  ,/,\n",
      "  and/CC\n",
      "  regain/VB\n",
      "  (NP the/DT blissful/JJ seat/NN)\n",
      "  ,/,\n",
      "  Sing/NNP\n",
      "  ,/,\n",
      "  Heavenly/NNP\n",
      "  Muse/NNP\n",
      "  ,/,\n",
      "  that/WDT\n",
      "  ,/,\n",
      "  on/IN\n",
      "  (NP the/DT secret/JJ top/NN)\n",
      "  Of/IN\n",
      "  Oreb/NNP\n",
      "  ,/,\n",
      "  or/CC\n",
      "  of/IN\n",
      "  Sinai/NNP\n",
      "  ,/,\n",
      "  (NP didst/NN)\n",
      "  (NP inspire/NN)\n",
      "  That/WDT\n",
      "  (NP shepherd/NN)\n",
      "  who/WP\n",
      "  first/RB\n",
      "  taught/VBD\n",
      "  (NP the/DT chosen/NN)\n",
      "  (NP seed/NN)\n",
      "  In/IN\n",
      "  (NP the/DT beginning/NN)\n",
      "  how/WRB\n",
      "  the/DT\n",
      "  heavens/NNS\n",
      "  and/CC\n",
      "  (NP earth/NN)\n",
      "  Rose/NNP\n",
      "  out/IN\n",
      "  of/IN\n",
      "  (NP Chaos/NN)\n",
      "  :/:\n",
      "  or/CC\n",
      "  ,/,\n",
      "  if/IN\n",
      "  Sion/NNP\n",
      "  (NP hill/NN)\n",
      "  Delight/NNP\n",
      "  (NP thee/NN)\n",
      "  more/RBR\n",
      "  ,/,\n",
      "  and/CC\n",
      "  Siloa/NNP\n",
      "  's/POS\n",
      "  (NP brook/NN)\n",
      "  that/WDT\n",
      "  flowed/VBD\n",
      "  Fast/NNP\n",
      "  by/IN\n",
      "  (NP the/DT oracle/NN)\n",
      "  of/IN\n",
      "  God/NNP\n",
      "  ,/,\n",
      "  I/PRP\n",
      "  thence/VBP\n",
      "  Invoke/NNP\n",
      "  (NP thy/NN)\n",
      "  (NP aid/NN)\n",
      "  to/TO\n",
      "  my/PRP$\n",
      "  (NP adventurous/JJ song/NN)\n",
      "  ,/,\n",
      "  That/IN\n",
      "  with/IN\n",
      "  (NP no/DT middle/JJ flight/NN)\n",
      "  intends/VBZ\n",
      "  to/TO\n",
      "  soar/VB\n",
      "  Above/NNP\n",
      "  (NP th/NN)\n",
      "  '/''\n",
      "  (NP Aonian/JJ mount/NN)\n",
      "  ,/,\n",
      "  while/IN\n",
      "  it/PRP\n",
      "  pursues/VBZ\n",
      "  Things/NNP\n",
      "  unattempted/JJ\n",
      "  yet/RB\n",
      "  in/IN\n",
      "  (NP prose/NN)\n",
      "  or/CC\n",
      "  (NP rhyme/NN)\n",
      "  ./.)|\n",
      "|(S\n",
      "  And/CC\n",
      "  (NP chiefly/NN)\n",
      "  (NP thou/NN)\n",
      "  ,/,\n",
      "  O/NNP\n",
      "  Spirit/NNP\n",
      "  ,/,\n",
      "  that/IN\n",
      "  (NP dost/NN)\n",
      "  (NP prefer/NN)\n",
      "  Before/IN\n",
      "  all/DT\n",
      "  temples/NNS\n",
      "  th/VBP\n",
      "  '/''\n",
      "  (NP upright/JJ heart/NN)\n",
      "  and/CC\n",
      "  (NP pure/NN)\n",
      "  ,/,\n",
      "  Instruct/NNP\n",
      "  me/PRP\n",
      "  ,/,\n",
      "  for/IN\n",
      "  (NP thou/JJ know'st/NN)\n",
      "  ;/:\n",
      "  thou/CC\n",
      "  from/IN\n",
      "  the/DT\n",
      "  first/JJ\n",
      "  Wast/NNP\n",
      "  (NP present/NN)\n",
      "  ,/,\n",
      "  and/CC\n",
      "  ,/,\n",
      "  with/IN\n",
      "  mighty/JJ\n",
      "  wings/NNS\n",
      "  outspread/VBP\n",
      "  ,/,\n",
      "  Dove-like/NNP\n",
      "  (NP sat'st/NN)\n",
      "  brooding/VBG\n",
      "  on/IN\n",
      "  the/DT\n",
      "  vast/JJ\n",
      "  Abyss/NNP\n",
      "  ,/,\n",
      "  And/CC\n",
      "  mad'st/VB\n",
      "  it/PRP\n",
      "  pregnant/JJ\n",
      "  :/:\n",
      "  what/WP\n",
      "  in/IN\n",
      "  me/PRP\n",
      "  is/VBZ\n",
      "  dark/JJ\n",
      "  Illumine/NNP\n",
      "  ,/,\n",
      "  what/WP\n",
      "  is/VBZ\n",
      "  (NP low/JJ raise/NN)\n",
      "  and/CC\n",
      "  (NP support/NN)\n",
      "  ;/:\n",
      "  That/DT\n",
      "  ,/,\n",
      "  to/TO\n",
      "  (NP the/DT height/NN)\n",
      "  of/IN\n",
      "  (NP this/DT great/JJ argument/NN)\n",
      "  ,/,\n",
      "  I/PRP\n",
      "  may/MD\n",
      "  assert/VB\n",
      "  Eternal/NNP\n",
      "  Providence/NNP\n",
      "  ,/,\n",
      "  And/CC\n",
      "  justify/VB\n",
      "  the/DT\n",
      "  ways/NNS\n",
      "  of/IN\n",
      "  God/NNP\n",
      "  to/TO\n",
      "  men/NNS\n",
      "  ./.)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |\n",
      "+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
      "only showing top 2 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df_NP_chunks.show(2, truncate=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
